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^>- Abstract 

r) SVM classifiers are very effective mining tools, but scalability for mas- 

• sive datasets is still an open issue. The Frank- Wolfe (FW) method deals 

» N with large-scale instances of the general problem of maximizing a concave 

I I function on the unit simplex, and can be specialized to SVM training to 

obtain algorithms with remarkable theoretical properties and competitive 
performance in practice. 

We present and analyze a variant of the FW method designed to ob- 
tain improved performance on large-scale SVM problems. The algorithm 
is based on a new way to perform away steps, a well-known strategy 
employed to accelerate the convergence of the basic FW method. We 
demonstrate that the method matches the guarantees in terms of conver- 
gence rate and number of iterations obtained by using classic away steps. 
In particular, the method enjoys a linear rate of convergence. 

On the practical side, we provide experiments on several classification 

• . datasets, and evaluate the results using statistical tests. Experiments 

^ show that our method is faster than the FW method and works well even 

k>( in the cases in which classic away steps slow down the algorithm. Further- 

*%_] more, these improvements are obtained without sacrificing the predictive 

C^ accuracy of the obtained SVM model. 

1 Introduction 

Training non-linear SVMs on large datasets is challenging fVD. Effective Interior 
Point Methods can be devised under some special circumstances, such as kernels 
which admit lovif-rank factorizations [T^lllS]. However, these methods are not 
suitable for large-scale problems in a general scenario mainly due to memory 
constraints: a general interior point method needs 0{m^) memory and 0{m^) 
time for matrix inversions, and both are prohibitive even for medium-scale prob- 
lems. Among the traditional methods devised to cope with this problem. Active 



Set methods [331 [H 1311 and Sequential Minimal Optimization (SMO) [H [JT] 
are the best-known alternatives. Indeed, these are the algorithms of choice in the 
widely known libraries SVMLight [25] and LIBSVM [8], respectively. Stochastic 
Gradient Descent 01 [S] and online sub-gradient methods [321 I3H] have lately 
gained popularity in the community as approximate but efficient alternatives to 
the classic solutions. 

Other effective methods to deal with large datasets have been recently de- 
vised by focusing on formulations which lead to a differentiable concave maxi- 
mization problem on the unit simplex S, 

maximize g{ct) subject to a £ S := < a £ M™ : 2. Q^i = 1 j ckj > > . (1) 

This class of models encompasses, in particular, the L2-I0SS SVMs (L2- 
SVMs) for binary classification, regression and novelty detection [HIIH]. See 
also [9] for a more exhaustive list of computational tasks fitting problem (IT]). 

It has been noted by researchers in different fields that this problem can be 
solved approximately by simple iterative schemes. In [37] , for instance, Yildirim 
presents two greedy algorithms for the task of approximating the minimum 
enclosing ball (MEB) of a set of points. In [T], Ahipasaoglu et al. propose 
similar methods to solve Minimum Volume Ellipsoid problems, introducing sev- 
eral techniques to analyze the convergence of these algorithms. In 48 , Zhang 
studies similar techniques for convex approximation and estimation of mixture 
models. All these methods are variants of a general approximation procedure 
for maximizing a differentiable concave function on the simplex, tracing back 
to Frank and Wolfe jiTTJ HSJ |TS] and recently analyzed by Clarkson in [5j . In 
a nutshell, each iteration of the standard Frank- Wolfe (hereafter FW) method 
moves the solution towards the direction in which the linearized objective func- 
tion increases most rapidly but is still feasible. The procedure is related to the 
idea of coreset, coined in the context of computational geometry, and denoting 
a (typically small) subset of data C^, which suffices to obtain an approxima- 
tion of the solution on the whole dataset up to a given precision e. Clarkson's 
framework unifies diverse results regarding the existence of small coresets for 
geometrical and machine learning problems. An extension of this framework, 
providing stronger and more general convergence guarantees, has been recently 
studied in [53]. 

In the context of large-scale SVM training, the first work to specialize a 
variant of the FW method to SVM training is probably due to Tsang et al. 
[41| . Borrowing a coreset-based algorithm from computational geometry 7J , the 
authors obtain that the number of training patterns and iterations required to 
obtain an approximation to L2-SVMS up to any arbitrary precision e is bounded 
by 0(l/e), independently of the size of the dataset. The obtained training 
algorithm exhibits linear running times in the number of examples and memory 
complexity (number of support vectors) independent of the number of examples. 
In addition, a combination of this procedure with certain sampling techniques 
allows to obtain sub- linear time approximation algorithms [411 113] . In practice, 



the method is found to be competitive with most traditional SVM software. A 
generaUzation of this method, which can be apphed to a wider class of models, 
is presented in [42] , 

Several papers have recently stressed the efficiency of FW and coreset-based 
methods in machine learning. In [T3] and [T5] the authors investigate the di- 
rect application of the FW method to large-scale SVM training, demonstrating 
that running times of |41j can be significantly improved as long a minor loss 
in accuracy is acceptable. Variations of the algorithm based on geometrical re- 
formulations of the learning problem [261 IE] j stochastic variants of the method 
[25] , and applications to SVM training on data streams [HI [ST] and structural 
SVMs [2Z] have also been proposed. 

In this paper, we introduce a modification of the FW method to deal with 
large-scale instances of problem (fll). The presentation and the analysis of the 
method is deliberately general, so that the technique can be employed to solve 
a number of problems to which the FW method has found applicability. Nev- 
ertheless, our research is mainly motivated by SVMs. We thus illustrate our 
ideas and focus our experimental analysis on the L2-SVM for binary classifica- 
tion. Given a labelled set of examples {(xi,yi) : i e I}, this learning task is 
formulated as the solution to the following problem 

maximize g{a) = oi^Ka subject to \^ a,; = 1 ,0;^ > , (2) 

where Ki^j — yiyjk(xi, Xj) -I- yiyj + 6ij/C, k(x.i, Xj) is the kernel function used 
in the SVM model and C is the regularization parameter [411 1131 HI] • Prob- 
lem [2] clearly fits problem (fl]). This formulation is preferred mainly because of 
efficiency: by adopting the functional of Eqn. (l2]), it is possible to exploit the 
framework introduced in [41| , and further developed in 9J , to solve the learning 
problem more easily. Note also that in problem ([2| JC is positive definiterland 
thus g{-) is strictly concave. 

Our method is obtained by incorporating a new type of away step into the 
standard FW method. Loosely speaking, instead of moving the solution to- 
wards a direction in which the linearized objective function increases, an away 
step moves the solution away from a direction in which the linearized objec- 
tive function decreases. This strategy was explored by Wolfe in [45] in order 
to improve the convergence rate of the FW method, leading to a variant of 
the original algorithm called Modified Frank- Wolfe method (hereafter MFW). 
It has been demonstrated that MFW is linearly convergent to the optimal value 
of the objective function under rather weak assumptions on the form of the 
objective, a property which cannot be guaranteed by the standard FW method 
in general [ll[T8l|47]. However, we have found in [14 that classic away steps do 



^Note that K can be written as K = yy^ AT + yy"^ '^71'^' '^^^^'^ Y i^ t^<5 column 
vector whose components are the labels j/i, K is the Gram matrix Kij = fc(xi,Xj) and is 
the Hadamard or componentwise product. The matrix yy^ is clearly positive definite since 
oi-"^yy^ 01 = ||ct-^y|p. Positive semi-definiteness of K is guaranteed by definition if we assume 
a Mercer kernel is used in the SVM model I35| . Finally, positive-definiteness is preserved by 
the Hadamard product and linear operations. 



not improve significantly the running times of tlie FW method on large SVM 
problems. A similar conclusion is obtained by Ouyang et al. in [25] , 

Our contributions are twofold: on the theoretical side, we introduce a new 
way to perform away steps into the FW method, demonstrating that the ob- 
tained algorithm enjoys the same theoretical properties of MFW. In particular, 
we demonstrate that the method converges linearly to the optimal value of the 
objective function, and provide a bound on the number of iterations needed to 
achieve a predetermined accuracy. On the practical side, we perform detailed 
experiments on several SVM problems, concluding that our algorithm improves 
the running times of existing FW approaches without any significant difference 
in terms of prediction accuracy. In particular, we show that the method is 
faster than or equal to the FW method when MFW is significantly slower and 
it is competitive with MFW when FW is significantly slower. The method thus 
represents a robust practical alternative to traditional FW schemes with strong 
theoretical guarantees. 

The paper is organized as follows. In Section 2 we give an overview of 
sparse approximation techniques based on the Frank- Wolfe algorithm, while 
introducing some key concepts for further analysis of the method. In Section 3 
we introduce our method, giving some details about its specialization to SVM 
training. In Section 4, we present the main theoretical results regarding our 
method. Experimental results are presented in Section 5. Finally, Section 6 
closes the paper with some concluding remarks. In addition, a list of some 
technical results used throughout the paper is reported in the Appendix. 

1.1 Some notation and Terminology 

An optimal solution for problem (IT]) is denoted a* . A sequence of approxima- 
tions ao, cti, . . . , Q!fc to a solution of problem M is abbreviated {a.k}k- The 
set of indexes 1,2, . . . ,7tt, is denoted [m]. The face Si of the unit simplex S 
corresponding to a set of indexes I C [m] is the subset of points a e S* such 
that OLj = OVj ^ T. The term active face indicates the face corresponding to 
the non-zero indexes, I^, of the current solution a.^. The term optimal face, 
denoted by S* , indicates the face corresponding to an optimal solution a* . The 
vector e, denotes the i-th vector of the canonical basis. 



2 Frank- Wolfe Methods 

The FW method computes a sequence of approximations {ccfcl/c to a solution 
of problem M by iterating until convergence the following steps. First, a linear 
approximation of g{-) at the current iterate otu is performed in order to find an 
ascent direction 

Ufc = argmaxV-fcCu) := g{a.k) -I- (u - OLk)'^V g{a.k) ■ (3) 

ues 

Since u^ lies in S, it is easy to see that the linear approximation step re- 
duces to Ufc ~ e,;* where i* is the largest coordinate of the gradient, i.e. 



i* — a,TgmaxiVg{ak)i. The iterate ctk is then moved towards e^*, seeking 
the best feasible improvement of the objective function. The procedure is sum- 
marized in Algorithm [T] 

Algorithm 1: FW Method for problem (IT]). 

1 Compute an initial estimate cxq. 

2 Set Iq^ {i : ao,* ¥" 0}. 

3 for fc = 0, 1, . . . do 



Search for i* = argmax^ Wg{af^)i and define d^^ — e,* — a./.. 
Perform a line-search A^ = argmaxjij^g tq j^i g{ak + Ad^*^). 
Update the iterate by cuk+i = otk + A^d^^ = (1 — Xk)oi.k + ^k^i 
Set Ifc+i ^Xfc U {i*}. 



As discussed below, the procedure can be stopped when g{oLk) is "close 
enough" to the optimum. 

2.1 Optimality Measures and Stopping Condition 

It can be shown that the FW method is globally convergent under rather weak 
assumptions on the form of the objective function [HI [17], which are guaranteed 
to hold for the SVM problem (|2| [HHH]. In addition, it can be shown that the 
iterates of this procedure satisfy 

AP(afe):=.g(«*)-gK)<i^, (4) 

where Cg is a constant related to the second derivative of g [S] ■ This convergence 
rate is rather slow, compared to other methods. However, the simplicity of the 
procedure implies that the amount of computation per iteration is usually very 
small. This kind of tradeoff can be favorable for large-scale applications, as 
testified for example by the widespread adoption of the SMO method in the 
context of SVMs [21 HI]. 

When g{oi.) is continuously differentiable, the Wolfe dual of problem (fTl) is 

Ti\&xw{<y.) , with w{<y.) — g{(y.) -\- T[v&yi'^g{oi.)i — a Wg{oi.) . (5) 

(X ' i 

As shown in 9J, the strong duality condition 

g{oi-) < 5(0:*) — w{a*) < w{a) (6) 

holds for any feasible a. Thus, another reasonable measure of optimality for 
the Frank- Wolfe iterates is the so-called primal-dual gap 

A'*(q;) := w{a) - g{a) = max V5(a)i - Q;^V.g(a) . (7) 

i 

Up to a multiplicative constant (4Cg), the primal-dual gap in Eqn. (l7| and 
the primal measure of approximation in Eqn. (HI) are the metrics employed in 



[5] to analyze the convergence of Algorithm n] The advantage of A'^{ak) with 
respect to A^ (a/j) is that the former does not depend on the optimal value of 
the objective function. Therefore, A'^{ak) can be explicitly monitored during 
the execution of the algorithm and can be adopted to implement a stopping 
condition for Algorithm [T] In this paper, we adopt this measure to stop the FW 
method and any of its variants. That is, the algorithms are terminated when 

A'^iak) = maxV5(afc), - alVg{ak) < e , (8) 

i 

where e > is a given tolerance parameter. Note that the strong duality con- 
dition implies AP{a.k) < A'^{Q.k)- Therefore, if the algorithm stops at iteration 
k we also have AP(afe) < e. 

Note also that Eqn. Q implies that the FW method finds a solution fuUfiling 
AP{ak) < £ in at most K ^ 0{l/e) iterations. Clarkson has recently shown 
that we also have A'^{a.) < e after at most K ^ 0{l/e) iterates [1]. Thus, the 
solution found by the FW method using the stopping condition Q is guaranteed 
to be "close" to the optimum both primally and dually after 0{l/e) iterations. 

In the analysis presented in this paper, we make use of the following notion 
of approximation quality introduced in |ll] . 

Definition 1. A feasible solution a to problem (fTl) is said a A-approximate 
solution if 

A\a) < A (9) 

and A^ia) := \7g{a), - a'^\/g{a) > -A Vi : a, > . (10) 

The first condition guarantees that a A-approximate solution is "close" to the 
optimum both primally and dually. In addition, the second condition ensures 
that —A < A|(q:) < A for the active face, that is, the primal-dual gap computed 
on each active coordinate i : cii > is not far from the largest gap computed 
among all the coordinates of the gradient. This implies also that the solution Ckk 
is "almost" optimal in the face of the simplex defined by the non-zero indexes. 

2.2 Sparsity of the FW solutions and Coresets 

On of the main points of interest for the FW method is the sparsity of the 
solutions it finds. It should be observed that Algorithm [T] modifies only one 
coordinate of the previous iterate at each step. If the starting solution has Kq 
non-zero coordinates, iterate ccfe has at most Ko + k non-zero entries. Therefore, 
our previous remarks about the convergence of the FW method show that there 
exist solutions with space-complexity Ko + 0{l/s) that are good approximations 
for problem ([I]), even if m (the dimensionality of the feasible space and the 
number of data points in SVM problems) is much larger. 

Existence of sparse approximate solutions for problem (IT]) can be linked to 
the idea oi e-coreset, first described for the MEB and other geometrical problems 
[Tf] . For e > 0, an e-coreset P' C P has the property that if the smallest ball 
containing P' is expanded by a factor of 1 -f e, then the resulting ball contains 



p. That is, if the problem is solved on P' , the solution is "close" to the solution 
on P. The existence of e-coresets of size 0{l/e) for the MEB problem was 
first demonstrated by Badoiu and Clarkson in [6l [7] . Note that in large-scale 
applications, 1/e can be much smaller than the cardinality of P. 

In [5], Clarkson provides a definition of coreset that applies in the general 
setting of problem (IT]). In a nutshell, a e-coreset for problem (fTj) is a subset 
of indexes spanning a face of S on which we can compute a good approximate 
solution. Existence of small er-coresets implies the existence of sparse solutions 
which are optimal in their respective active faces. The practical consequence of 
this result would be the possibility of solving large instances of (II]) working with 
a small set of variables of the original problem. 

Definition 2. An e-coreset for problem (JlJ is a set of indexes X C [m] such 
that the solution aj to the reduced problem 

maximize g{a.) subject to a. E Sx ■= {a E S : a.i = 0,Vi ^ X} . (n) 

satisfies A''(aj) < £■ 

As argued in [3], the FW method is not guaranteed to find a e-coreset after 
0{l/e) iterations. Instead, ©(l/e^) iterations may be required. However, the 
computationally intensive variant of the FW procedure presented in Algorithm 
[2] does the job. 



Algoritiini 2: Clarkson's method to find coresets for problem (1) 



1 Compute an initial estimate cxq. 

2 Set Io = {i : ao,* 7^ 0}. 

3 for fc = 0, 1, . . . do 



Search for i* = argmax; Vg^ak) 
Setlfe+i =2'fcU{i*}. 



Solve the reduced problem ( 11 ) with T = Tj, 



Note that Algorithm [2] needs to solve an optimization problem of increas- 
ing size at each iteration. This can be considered a generalized version of the 
well-known Badoiu-Clarkson (BC) method to compute MEBs in computational 
geometry and, up to our knowledge, corresponds to the first variant of the FW 
method applied to SVM problems [41] . 

2.3 Boosting the Convergence using Away-steps 

It is well-known that the FW method often exhibits a tendency to stagnate 
near the solution a* , resulting in a slow convergence rate |18j . As discussed 
in |37| HI], this problem can be explained geometrically. Near the solution, 
the gradient at cik has a tendency to become nearly orthogonal to the face of 
the simplex spanned by X^ (the non-zero coordinates of ctk). Therefore, very 



little improvement can be achieved by moving a.k towards the ascent vertex 
Ufc. However, since the solution is not optimal, it is reasonable to think that 
the solution can be improved working on the face spanned by Ifc. Actually, 
Algorithm [2] works on I^ till approximate optimality before exploring the next 
ascent direction. 

It can be shown that the convergence of the FW method can be boosted by 
introducing a new type of optimization step. In short the idea is that, instead of 
moving towards the direction maximizing the local linear approximation '0fe(') 
of g{-)j we can move away from the vertex of the current face v^ minimizing 
tpki')- At each iteration, a choice between these two options can be made by 
estimating the best possible improvement in the objective function. 

Since the descent direction v^ must lie in the current active face, it is easy 
to see that the linear approximation step reduces to Vk = e^*, where j* is the 
smallest active coordinate of the gradient, i.e., j* ~ argmin^gi^ 'S/g{a.k)j. The 
whole procedure, known as Modified Frank- Wolfe algorithm (MFW) is summa- 
rized in Algorithm |3] 

Algorithm 3: MFW Method for problem 0. 

1 Compute an initial estimate olq. 

2 Set Io = {i: ao,* 7^ 0}. 

3 for fc = 0, 1, . . . do 

Search for i* = argmax; Wg{ak)i and define d^ = ej.^ — OLk . 



4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 



Search for j* = argmin^gij. Wg{ak)j and define d^ — a^ — Gj 
if V.g(afe)^dfi^ > Vgiakfd^ then 

Perform a line-search Af„ = argmax^j^^rg ^^ g{otk + -^d^^). 

Perform the FW step otk+i — otk + Atw(ei* — otk). 

Update Ife by Ik+i = Ik U {i*}. 
else 

Perform a line-search A^way = a'i'ginax;)^grQ j^i g{ak + Ad^). 

Clip the line-search parameter, Aa„ay* = inax(Aa„ay, afcj*) 

Perform the AWAY step, a.k+i = CKfc + Aa„ay*(ei* — e^*). 

Update Ife by Ik+i = X^ U {i*}. 

If K^^y-k = ak,j*, Ik+i = Ik+i \ {j*}- 



In contrast to the FW method, for which only a sub- linear rate of convergence 
can be expected in general [HI |47] , it has been shown that the MFW method 
asymptotically exhibits linear convergence to the solution of problem M under 
rather weak assumptions on the form of the objective function [T5J|^[T]. In 
addition, the MFW algorithm has the potential to compute sparser solutions in 
practice, since in contrast to the FW method it allows reducing the coordinates 
of OLk at each step. 



2.4 Adaptations to SVMs 

In the context of SVM learning, the work of Tsang et al. in [41] was arguably 
the first to point out the properties of the algorithms than can be obtained by 
applying FW methods to formulations fitting problem ([I]). Their work relies 
on the equivalence between the SVM problem (pi) and a MEB problem, which 
holds under a normalization assumption on the kernel function employed in the 
model imilS]. Exploiting this equivalence, and adapting the Badoiu-Clarkson 
algorithm for computing a MEB to the problem of training non-linear SVMs, 
an algorithm called Core Vector Machine (CVM) is obtained, which enjoys re- 
markable theoretical properties and competitive performance in practice |41j . 

First, the number of support vectors of the model obtained by the CVM is 
Kq + 0{l/e) where Kq is a constant and e is the tolerance parameter of the 
method. Therefore, the space complexity of the model is independent of the size 
and dimensionality of the training set. Second, the number of iterations of the 
algorithm before termination is also ©(l/e), independent of the size and dimen- 
sionality of the training set. To determine the overall time complexity of this 
method, we note that Algorithm [2] requires a search for the point i* representing 
the best ascent direction in the current approximation of the objective function, 
an operation that is also performed by the FW and MFW methods. Searching 
among all of the m training points requires a number of kernel evaluations of 
order O^q^ + mqk) = 0{mqk), where qt is the cardinality of X^. Since the car- 
dinality of Xfc is bounded as 0{l/e) (the worst-case number of iterations), we 
obtain that the CVM has an overall time complexity of 0{m/e), linear in the 
number of examples, improving on the super-linear time complexity reported 
empirically for popular methods like SMO to train SVMs [2S1 E] ■ 

If m is very large, however, the complexity per iteration can still become 
prohibitive in practice. A sampling technique, called probabilistic speedup, was 
proposed in [36] to overcome this obstacle. This technique was also used to im- 
plement the CVM in [HI |4^ leading to SVM training algorithms with an overall 
time complexity which is independent of the number of training examples. In 
practice, the index i* is computed just on a random subset ^{S') C f{S) of 
coordinates, with |5'| <C \S\ — constant. The overall complexity is thereby 
reduced to order 0{q^ + qu) = 0{q\), a major improvement on the previous 
estimate, since we generally have qk <C m. Refer to [35] or [41] for details about 
this speed-up technique. 

More recently, several authors have explored the adaptation of the original 
FW methods to the task of training SVMs. The advantage of Algorithms [T] and 
[3|over Algorithm [2] is that they rely only on analytical steps. As a result, each 
training iteration becomes significantly cheaper than a CVM iteration and does 
not depend on any external numerical solver. In practice, the training algorithm 
might probably require more iterations in order to obtain a solution within 
the predefined tolerance criterion e, but the work per iteration is significantly 
smaller. Such a trade-off has been shown to be worthwhile when dealing with 
large-scale applications [29l HU [13] . 

In [13l dl] the authors show that adopting Algorithms fl] and Is] the running 



times of [H] can be significantly improved as long a minor loss in accuracy is 
acceptable. From the analysis presented in [S], it is possible to conclude that 
this approach enjoys similar theoretical guarantees, namely, linear time in the 
number of examples and a number of iterations which is independent of the 
number of examples. The sampling technique to speed-up the computation 
of i* introduced above can be used with these methods as well, in order to 
obtain overall time complexities which are independent of the number of training 
patterns. 

In a closely related work |26j . Kumar and Yildirim present a specialization 
of the MFW method to SVM problems, adopting the geometrical formulation 
studied in [3]. This approach reformulates the SVM problem as a minimum 
polytope distance problem. The obtained method and its properties are also 
strongly related to the work of Gartner and Jaggi [19], in which the authors 
adapt the computationally intensive variant of the FW method to the minimum 
polytope distance problem. In [55], Ouyang et al. propose a stochastic variant 
of FW methods for online learning of L2-SVMS obtaning comparable and some- 
times better accuracies than state-of-the-art batch and online algorithms for 
training SVMs. A similar technique has recently been proposed in [SD] to allow 
smooth and general online convex optimization with sub-linear regret bounds 
[57] . Variants of the method proposed in [H] have been introduced in [33] and 
[5T] for training SVMs on data streams. In [57] the authors adapted the FW 
method to train SVMs with structured outputs like graphs and other combi- 
natorial objects [43L [5]. obtaining an algorithm which outperforms competing 
structural SVM solvers. 

3 The SWAP Method 

We have described in the previous sections how the basic FW method can be 
modified in order to avoid stagnation near a solution, obtaining in this way an 
algorithm with a guaranteed rate of convergence. Our previous remarks about 
the MFW method suggest that this algorithm should terminate faster and find 
sparser solutions. In practice however, the MFW method is not always as fast 
as one could expect from the theory. For instance, the experimental results 
reported in [47] and [1] for the MEB and Minimum Volume Ellipsoid problems 
respectively, show that very tight improvements, if any, are obtained using the 
enhanced method (MFW) with respect to the basic approach. As concerns 
the problem of training SVMs, results in [T3] confirm using statistical tests that 
MFW is not systematically better than FW. Indeed it may sometimes be slower. 
Similarly, the authors of |28j argue that the use of away steps does not provide 
a clear advantage with respect to the standard FW method. 

A possible interpretation of these results can be given by looking at the way 
in which MFW implements the away steps to keep feasibility, i.e., to ensure 
the constraint J^i ^i = 1 is satisfied. The basic idea in the MFW approach is 
to include the alternative of getting away from a descent direction, decreasing 
the weight of the corresponding vertex j* in the current solution, instead of 
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getting closer to an ascent direction, which would increase the weight of the 
corresponding vertex z*. The choice is mutually exclusive. If the algorithm 
decides to work around j*, it may lose the opportunity to explore a promising 
direction of the feasible space. And vice versa. 

On the other hand, if an away step is performed, the weights of the active 
vertices i €lk are uniformly scaled by (1+ A) to keep feasibility. This scheme not 
only does considerably perturb the current approximation, since all the weights 
are modified but, more importantly, can increase the weights of vertices which 
do not belong to the optimal face S*. Away steps in the MFW method are thus 
prone to increase the need of further away steps to eliminate those "spurious 
points" (z G Jfc, but i ^ S'*). 

Here, we introduce a new type of away step devised to circumvent these 
problems and to preserve the advantages of MFW. We discuss two variants of 
the method, obtained by using first and second order approximations of the 
objective function at each iteration, respectively. 

3.1 Main Construction 

The first variant of our method is obtained as follows. As in the MFW method, 
we consider, at each iteration, the maximum ascent direction 

i* = argmaxV.g(Q:fc)i , (12) 

and the maximum descent direction among the indexes spanning the current 
solution cxk, 

j* = argmax-V.g(Q;fc)j = arg min Vg{ak)j ■ (13) 

However, instead of updating the current iterate as a/c+i = afc+A {a.k — Gj* ); 
we propose a step of the form 

Ofe+i = ttfe + A(ej. -Gj.) , (14) 

where A is determined by a line-search, as usual. This scheme provides the 
following conceptual advantages. 

1. The away step perturbs the current solution otk only locally, in the sense 
that the weight of any vertex other than e^* and e^* is preserved. 

2. The away step does not increase the weight of vertices e^ of the active 
face corresponding to descent directions. These points may correspond to 
spurious points that need to be removed from the active face to reach the 
optimal face of the problem. 

3. The away step moves the current solution in the away direction and si- 
multaneously in the direction of a toward step. That is, it moves away 
from the descent vertex e^* , but also gets closer to the ascent vertex e^* 
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in the same iteration. The step (14 1 can be actually be written as the 
superposition of two separate steps, 



OLk+i = - (cKfe + A (ei- - OLk)) toward step 
(afc + A (afc - Bj.)) away step , 



(15) 



2 

where the first term of the right-hand side ct^ + A (e^. — a^) represents 
the standard toward step in the FW method and the second term, ctk + 
A (ccfc — Gj.), the away step considered in the MFW approach. Note that 
the term Aa^ disappears in the sum, so that only the components corre- 
sponding to i* and j* are updated, leaving the rest of the current solution 
unchanged. 

The new type of away step is called a SWAP step and substitutes the MFW 
away steps in Algorithm [3] The procedure is summarized in Algorithm |4] Note 
that we deliberately include some steps which do not represent computational 
tasks but definitions which simplify the convergence analysis of the next section. 

So as to choose the type of step to perform, the MFW criterion cannot be 
employed in our method. The MFW method employs a first order approxima- 
tion of g{-) at the current iterate to predict the value of the objective function 
at the next iterate. That is, if d denotes the search direction, 

V'fc(afe + Ad) = g{ak) + Ad'^V5(afe) . (16) 

is computed. The step which gives the largest value of ^pk is selected. However, 
a SWAP step always gives a larger value of ipk than the value obtained using a 
toward step. Indeed, the value of tpk using a SWAP step is 

ipk (ttfe + Xdsi,ap) = g(afe) + A (ei- - e^-. )^ ^g{ak) 

= g{a.k) + \Vg{a.k)^* ~ XVg{a.k)j* . (17) 

The value of ipk using a toward step is 

V'fc(afe + Ad/u,) = g(afe) + A (e^. - otkf V g{a.k) 

= g{ak) + \\Jg{a.k\. - Aa^Vg(afc) . (18) 

Since a'^Wg{a.k) is always larger than \7g{a.k)j' , a SWAP step would always be 
preferred using first-order information to predict the objective function value. 

To address this problem we observe that the MFW method computes an ex- 
act line-search for the search direction selected using ipk ■ We thus formulate our 
method computing the line-search before deciding the type of step to perform. 
This design requires to perform two line-searches instead of one. However, the 
estimation of the objective function value at the next iterate is more accurate. 

As we will discuss in the section regarding the adaptation of the procedure 
to the SVM problem, this computation is particularly simple for the objective 
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Algorithm 4: The SWAP Algorithm. 



5 
6 

7 
8 
9 

10 



1 Set A: = 0. 

2 Compute an initial estimate cxq. 

3 Set 2o = {i : ao,j 7^ 0}. 

4 for fc = 0, 1, . . . do 

Search i* = argmax^ V(7(q:j.)j {ascent direction). 

Search j* = argminj:Q,^ .^o'Vg{a.k)j{descent direction). 

Line-search X^^^p = argmaxAg[o,i] g {ot-k + A(ei* — ej.^,)) . 

Line-search At„ = argmax>g[o^i] g {otk + A(ei* — ak)) • 

Compute ^j.„.jp = g (cKfc -I- X^^^p^ei^ — ej.^,)) — g{otk) {improvement of a 

SWAP step). 

Compute ^f„ = g {ock -t- At„(ei* — ccfc)) — g{oik) {improvement of a 

toward step). 

Compute 5k = niax (Jg^^p, (5f„) {the best improvement). 

if 4 = <5,„,p then 

Clip the line-search parameter, A^^^p* — Tiiax{X^„._^p, ak.j*) 

If Aa^^p^, = ttfcj* mark the iteration as a SWAP-drop step. 

If Aawap* = Ag^ap niark the iteration as a SWAP-add step. 

Perform the SWAP step, ctk+i = d/c + A,„„p^(ei* — 6^*). 

Update Ik by Ik+i =IkU {«*}■ 

If a SWAP-drop step was done, Xk+i = Ik+i \ {]*}■ 
else 

Mark the iteration as a FW step. 

Perform the FW step otk+i = a^ + Atw(e,;* — otk). 

Update Ik by Ik+i = Ik U {i*}. 



11 
12 

13 
14 
15 
16 
17 
18 
19 
20 
21 
22 



function in problem (pi). All the computations are analytical. Furthermore, 
the exact computation of (5f„ and (5j.„^p involve terms already computed in the 
line-searches and therefore does not represent an additional overhead for the 
algorithm. 

3.2 A Second-order Variant 

All the FW methods introduced previously make use of first-order approxima- 
tions of the objective function in order to determine the direction toward which 
the current iterate should be moved. Here, we consider the possibility of using 
a second-order approximation. If we assume that the objective function is twice 
differentiable, the second-order Taylor approximation of g(-) in a neighborhood 
of OLk is 

g {cxk + Ad) « g(afe) + AVg(afc)^d + ]^X''d^^^g{cXk)d , (20) 

where the Hessian matrix '^'^g{a.k) is negative semi-definite. Determining the 
best ascent direction would thus imply the computation of the quadratic form 
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Algorithm 5: The SWAP-2o Algorithm. 



Proceed as in Algorithm H but modify step pi as follows: 



(Vg(Qfc)»» - Vg(afc)j) 
'JiT;-2(v2 ,. -2V2 ,+V2 






d^V^(7(Q;fe)d. Since the matrix V'^g{a.k) may be highly dense, which is usually 
the case in SVM applications, employing a first order relaxation as in Frank- 
Wolfe methods makes sense in order to obtain lighter iterations. However, we 
note that the search direction for a SWAP step d^^^p = e^. — e^. yields a 
particularly simple expression 

3(afc + Adawap) ~ g{oLk) + y^ g{a.k)^ {ei* - ej.) + -A^(ei. - e-,-.)^V^g(afc)(ei. -e^-.) 



7(afc) + A (V3(afc)»* - Vg(afc)j-) + J^^ (V?v,. - 2V?.,^. + V 



2 



(21) 
where Vf^ = V'^g{ak)t,j. 

In order to determine the best pair ei*,ej» we thus need to evaluate three 
entries of the Hessian matrix. However this is still a computationally hard task 
for each iteration, since we would need to consider m\X}^^ pairs of points in order 
to take a step. We thus adopt the strategy used in the second-order version 
of SMO proposed in [TT]. We fix the best ascent index i* just as in the first- 
order SWAP and search for the index j* in the active set which maximizes the 



improvement of the second order approximation (21). We call the obtained pro- 
cedure second-order SWAP and we denote it as SWAP-2o in the next sections. 
It is worth to note that this approximation is exact for quadratic objective 
functions, which is the case for the SVM problem (l2]). Note also that in this 
case the line-search along the ascent direction d^ defined by i* and j* has a 
closed-form solution. Indeed, 

X ^ (V.9(Qfc)»» - Vg(Qfc)j->) 
From the negative semi-definiteness of V^.g(Q:i:) it follows that A* is non- negative. 



Substituting this step-size in (21 1, the improvement in the objective function 
becomes 



g{cxk + A*d_J - g{ak) = ^Vy2 "'^ oV"" ""' + V^ T ' ^^^^ 



{Vg{ak)i' - Vg(Q:fc)j», 
-2 (Wl -. - 2V2. .. + V2 



which again, from the negative semi-definiteness of V^g(Q;fc), is non-negative. 
Naturally, we need to restrict the value of A* to the interval [0, 1] in order to 
obtain a feasible solution for the next step. We thus modify Algorithm |4] as 
specified in Algorithm [5) 
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3.3 Notes on the Adaptation to SVM Training 

Here we provide analytical expressions for all the computations required by 
Algorithm [4] and Algorithm [5] applied to the the SVM problem (|2| . Similar 
expressions follow for any quadratic objective function. 

For problem (l2]), the gradient and Hessian at given iterate otk take particu- 
larly simple expressions: 

V.g(afc) = -2Kak , V^gi^k) = 2K . (24) 

Notice that a.J\/g{a.k) — 2g{a.k). Therefore, the line-searches in Algorithm 
|4]or Algorithm [5] can be performed analytically as follows. For FW steps, 



™ 2{K,.,,,+yg{ak)^'-g{ock)) 

Note that the quantity Wg{ty.k)i* has been already computed to choose the 
ascent direction. For SWAP steps, 

, ^ V.g(Qfc)»' - Vg{cxk)j' ,2„s 

-'' 2[K,,^,.-2K,.^,,+K, -,,,.)■ ^ °^ 

The quantity \Jg{a}^)j* has been already computed to choose the descent 
direction. The improvement in the objective function can also be calculated 
analytically. For FW steps, 

^ _ {Wg{a.k)r- - 2g{cy.k)f 



i{Ki.^,, +Vg{ak)r' ~ gioik)) 

All the terms involved here have already been computed to perform the 
line-search. Similarly, for SWAP steps, 

c ^ (V.9(Qfc)i. -Vg{a.k)j'f , . 

™°^ 4(/f,.,,. -2K,.j. +K,.,,.)' ^ ' 

With the exception of the term Ki*j<-, all the computations have already 
been performed to compute S{„ and to search for the descent direction. We 
conclude that, compared with MFW procedure, the SWAP method adapted for 
problem ([2]) involves the computation of just one additional term, which is an 
entry of the kernel matrix K defining the SVM problem. 

The objective function value g{oi.k) can be computed recursively from the 
relationshijPl g{Q.k+i) — giot-k) + 5k- Finally, we observe that the stopping 
criterion of Eqn. (l8| takes the form 

A'^(afc) = V.9(afc)„ - 2g{ak) < e , (29) 

which involves the same already computed terms. 



^A similar recursive equation can be derived to handle the case of SWAP-drop steps. 
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4 Convergence Analysis of the SWAP Method 

In this section we study the convergence of the SWAP method on problem (fTl) , 
of which the L2-SVM problem (I2]) is a particular instance. 

We start by demonstrating the global convergence of the SWAP method. 
Then, we analyze its rate of convergence towards the optimum. For this pur- 
pose we will adapt the analysis presented in [T]. Using this framework and 
using a set of observations concerning the improvement on the objective func- 
tion after an iteration of the SWAP method, we will be able to prove that the 
algorithm converges linearly to the optimal value of the objective function. From 
a theoretical point of view these results show that the SWAP enjoys the same 
mathematical properties of the MFW method. Finally, we provide bounds on 
the number of iterations required to fulfill the stopping condition of Eqn. ([8]). 
We demonstrate that the algorithm stops in at most 0{l/e) iterations inde- 
pendently of the number of variables m, which coincides with the number of 
training examples in the SVM problem ^. 

Here we only provide proofs for the first-order SWAP method, described in 
Algorithm |4j However all the convergence results follow easily for the second- 
order variant as well. The statements and proofs of the technical results used 
in this section can all be found in the Appendix. 

We develop our analysis under the following assumptions: 

Bl. g is twice continuously differentiable; 

B2. There is an optimal solution a.* of the optimization problem satisfying the 
strong sufhcient condition of Robinson in [32] ■ 

The above hypotheses are just as strong as those imposed in [17] and [T] to 
study the convergence of Frank- Wolfe methods to the MEB problem and the 
minimum volume enclosing ellipsoid problem. 

In [18], convergence properties of FW and MFW methods were analyzed 
under the following alternative hypotheses: 

Al. V(7 is Lipschitz-continuous on the feasible set; 

A2. 17(0:) is strongly concave; 

A3. Let a* be optimal for ([TJ and T* be the smallest face of the feasible set 
containing a*. Then 

{a — a*yVg{oL*) = <^ a G T* (strict complementarity). 

However, this set of assumptions can be difficult to satisfy in practice. In par- 
ticular, A3 is a quite strong assumption and cannot be guaranteed in general. 

Note that assumption Bl obviously implies Al. In addition, Bl holds most 
of the time in machine learning problems. It can also be shown that if problem 
(fTl) is strongly concave, the strong sufficient condition of Robinson holds, i.e. A2 
implies B2 [T]. In particular, this is satisfied by the Wolfe dual of the i2-SVM 
problem. 
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4.1 Global Convergence 

Proposition 1. (Global Convergence) Starting from any feasible ao, Algorithm 
[^produces a series of iterates {afcjfc such that g{ak) converges to 5(0:*), where 
a* is a solution of problem pp. If a* is unique, {a.k}k converges to a* . 

Proof. There are two critical observations to be made. First, as mentioned 
above, Al (assumption 1 in [TH]) is fulfilled for a twice differentiable concave 
function. In fact, from the mean value theorem we have that 

V3(x) - Vff(y) = V'g(z) (x - y) , (30) 

with z on the line between x and y. Let L be the largest eigenvalue (in modulus) 
of the matrix V'^g{z) for z in the unit simplex. Then, 

||V5(x)-V3(y)||<L||x-y|| (31) 

for any x, y in the unit simplex. Thus, Vg{a) is Lipschitz-continuous on the 
simplex. The second observation is that both FW and SWAP search directions 
dfc in Algorithm |4] satisfy 

gia)-g{ak)<dlVg{ak) , (32) 

where d^ = (e^,, — ctfe) for FW steps and d^^ = (e^,, — Gj*) for SWAP steps. In 
the case of FW steps, the result was stated in [TS]. However, it is not hard to 
see that 

(ei, - ej,)^ \/g{cxk) > (e^, - ctk)'^ \/g(ak) , (33) 

32| ) also holds for SWAP steps. We now follow the proof of 
. After a step of Algorithm l4J with some step-size A G [0, 1], 



and thus Eqn. 
Theorem 1 in [18 
the new iterate ct^_^_i is computed as 

a^+i = afe + Adfc , (34) 

which includes the case of SWAP-drop steps. By the mean value theorem, 

g{oik+i) ~ g{ak) = XdlVg{ctk + XXdk) , (35) 

for some A e [0, 1]. Summing and subtracting Xd^Vg{a.k) leads to 

P(afc+i) - 5(afc) = XdlVg{ak) + Xdl (Vg{ak + AAd^) - Vg{ak)) ■ (36) 



By using Eqn. (32 I 



g(afc+i) - g{ak) > -dl\/g{ak) + ^ (sC"*) - ^(afe)) 



+ Ad^ (Vgictk + XXdk) - Vgiak) 



(37) 



From Eqn. (31 ) and the fact that A e [0, 1], we obtain 



g(afc+i) - g{ak) > -dl\/g{ak) + ^ (5(a*) - 3(afc)) - iA^||dfc||- 
> ^dlVgiak) + ^ (gia) - g{a,)) - LX^D^ , 
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(38) 



where D is the diameter of the simplex. Now, the term ^^V g{a.k) in (381 



is strictly positive because dT is an ascent direction. The other part of the 



right-hand side in ( 38 1 is a second order polynomial for A 



- g(a')-ff(a.) ^^^^ ^3^^ 



with two roots, Ai = and 

It is easy to see that this polynomia l is non-negative if and only if A e [Ai, A2] = 
[0,A2] ^ 0. Note now that Eqn. ^ holds for any A G [0,1]. Algorithm |4] 



searches in such interval for the value of A which maximizes the improvement 
in the objective function. Thus, this step necessarily lies in the interval [0, A2]. 
However, 

A < '^'^'l-j^i'^'^ => ^ (P(«*) - .(«.)) - LX^D^ > . (41) 

Thus, if we denote by ctfc+i the iterate built by Algorithm l4] after the line- 
search, 

g{ak+i) -g{ak) > -^dfc Vg(afc) , (42) 



with 



^ g{a*)-giak) 



m,=min(l, ^^ 2j^^V ) ■ (43) 



Note that Eqn. (42 1 continues to hold for SWAP-drop steps because clipping 



the step can only make A even smaller. Now, using Eqn. (32) again. 



g{ak+i) - gictk) > — {gia*) - g{a.k)) . (44) 
Rearranging the terms, 

g{oc*) - g{ak+i) < (l - ^) (^(a*) " .9(«/c)) ■ (45) 

Since the sequences {g{ak)}k and {mk}k are monotonic and bounded, they 



admit limits g°° and m°° respectively. By taking limits on both sides of (45), 
we have 

(5(a*) - gn < (1 - m-) (.g(a*) - g^) . (46) 

This implies that either g°° = g{ot.*) or m°° — 0, which both imply convergence 
oi {g{ak)}k to g{a*). D 

4.2 Analysis of the Rate of Convergence 

We now prove a linear convergence result for the SWAP algorithm. In the proof, 
we make use of the following technical Lemma. 

Lemma 1. After any iteration marked as SWAP-add or FW in Algorithm[A 
the iterate a.k is a /^-approximate solution with A = max [2\/L6k , 26k) ■ 
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Proof. It follows immediately from reordering Eqns. (84) and (97) in the Ap 



pcndix. n 

Note that this result holds for the SWAP algorithm and not for the FW 



method, since Eqn. (97) requires the SWAP steps. 

Note also that for convergence analysis purposes we can assume that 6^ < L 
for k sufficiently large. This follows from the fact that Algorithm |4] converges 
globally and that an iterate ak generated by the algorithm is always feasible. 
From the first fact it follows that g{ot*) — g{ak) becomes arbitrarily small for a 
sufficiently large k. From the second fact it follows that <?(«*) > g{oi.k)- Since 
(5fc is the improvement in the objective function at each iteration of Algorithm 
HI this quantity will be, from some iteration onwards, lower than any predefined 
constant, in particular L. Note now that ii 5k < L, then 

24 < i^/lTk ■ (47) 

Thus, Lemma [l] states that for sufficiently large k the iterate a.k produced by a 
SWAP-add or FW step is a A-approximatc solution with A = 2^/L5k- 

Proposition 2. Let a* he the solution of problem Tw. Then, for sufficiently 
large k, any iteration marked as SWAP-add or FW in Algorithm[^produces an 
iterate a^ satisfying the following inequality 



5("*) - g{o^k) V M 

for some constant M > 1. 

Proof. Lemma [T] shows that for sufficiently large k the iterate ctk produced 
by Algorithm [4]after a SWAP-add or FW step is a A-approximate solution, 
with A = 2\/LSk. In addition, since the SWAP is globally convergent, 5k can 
be chosen to be arbitrarily small. Thus, for k l arge enough, the conditions of 
Lemma p^ hold with A^ = 2\/L5k. From Eqn. (71), we then have that there 
exists a constant N such that 



2 



g{a*) - g{ak) < Nm (2/L4) = mmLSk , (49) 

where to ^ 1 is the dimensionality of a, A^ is a Lipschitz constant depending 
on the problem, and L is the largest eigenvalue (in modulus) of the Hessian 
matrix of .9(0:) on the simplex. Now, for a SWAP-add or a FW step we have, 
by definition of 5k , 

g{ak+i) - g{ctk) = 5k . (50) 

Note that the latter is not true for swap-drop steps because the real improvement 
in the objective function differs from the value computed to decide the type of 
step to perform. Thus, 

ff("*) - 5("fc) < M {g{oik+i) - g{oLk)) ■ (51) 
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with M — ANmL. Adding and subtracting Mg{a.*) to the right-hand side 
produces 

g{a*) - g{ak) < M (g(a*) - g{ak)) - M (gia*) - g{ak+i)) ■ (52) 
Equivalently, 



, 1 \ (53) 

(a*) - g{ak+i) < 1 - — {g{a*) - g{oik)) ■ 



Thus, 



M {g{a*) - giak+i)) < M {g{a*) - .g(afe)) - {g{a*) - g{ak)) 

1 
M 



,(«*)-,(.,,,) <^,_^, (54) 

n 



g{cx*)-g{ak) ' \ M 



This result is analogous to the linear convergence theorems obtained in [T] 
and gZ] for the MFW algorithm. 

Proposition 3. At any iteration of Algorithm [A the number of SWAP- drop 
steps does not exceed a half of the total number of steps T made by the algorithm, 
plus a finite constant. 

Proof. Let F be the number of FW steps, S the number of SWAP-add steps 
SWAP, C the number of swap-drop steps and A the number of steps that include 
points to the coreset 1^- We have A < F+S, because just FW steps and SWAP- 
add steps can add points to the coreset. Sometimes they include new points, 
sometimes they do not. Clearly T = F + S + C. Thus, from the previous 
inequality we have T > A + C. Now, it should be clear that the number of 
steps C that drop points from the coreset cannot be greater than the number of 
steps that add points to the coreset plus the number of points / in the coreset 
just after initialization, that is, ^4 -I- / > C. Combining the last two inequalities 
leads T + A + I>A + 2C, that is, T + / > 2C. Therefore C < f + |, which 
concludes the proof. D 

Proposition[2]states that there exist a subsequence of the iterates {cifclfc pro- 
duced by Algorithml4]such that {g{ak)}k converges linearly to the optimal value 
g{a*) of the objective function in problem (IT]). This subsequence is obtained by 
dropping from {ak}k the iterates corresponding to SWAP-drop steps, for which 
we can only say that the objective function value does not decrease. Thanks 
to Proposition [3] we know that these steps do not affect the overall complexity 
bound on the number of iterations needed to achieve a given accuracy. 

4.3 Iteration Complexity Bounds 

We start by proving the following lemma. 
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Lemma 2. By using the stopping condition of Eqn. ^, any iteration marked 
as SWAP-add or FW in Algorithm[^ produces an improvement in the objective 
function 



£2 e 



where e is the tolerance parameter. 

Proof. If the algorithm enters the loop after checking the stopping condition of 
Eqn. ([8|, 

Vg{a.k)i* - a.l\/g{ak) > £ ■ (56) 



From Eqn. ( 83 ) we obtain 

max [2a/Z4 , 2(5fe j > e , (57) 

which leads to the result O D 

Note that the converse is not true. The algorithm can stop even if the 
improvement in the objective function in the last iteration was greater than 

min ( |j , I ) ■ This happens because the proposed termination criterion funda- 
mentally looks at the possible improvement with standard FW steps. 

Proposition 4. Let K be the number of iterations performed by Algorithm [^ 
until the stopping condition of Eqn. Q) is fulfilled. Then, 

K<Q + — , (58) 

e 

where Q, M are constants independent of m and e. 

Proof. Let k{5) denote the number of iterations of Algorithm l4] from the first 
iterate such that the primal-dual gap satisfies A'' < 6 until the first that satisfies 
A*^ < (5/2. Since the total improvement in the objective function cannot be 
greater than S and the improvement in the objective function given by a SWAP- 
add or a FW step is at least that of Lemma [2] with e = S/2, we can bound k{S) 

as foUowcl 

wcN „ S ^ 16i 32L 

^(^)^2^^=2— = — . (59) 

4L 

where the multiplying factor 2 comes from the fact that the total number of 
iterations is at most two times the number of SWAP-add and FW iterations 
plus a finite constant (see the discussion in the proof of Proposition Isl). Now, let 
K{s) the number of iterations from the first iterate such that the primal-dual 
gap satisfies A'^ < 1 until the first that satisfies A'' < e. Clearly, K < K{e). 



■^Note that the previous proof is based on the minimal improvement of a FW step. The 
results holds in general because a SWAP step is performed if and only if the unconstrained 
SWAP yields a larger improvement. 

*We assume, by sake of simplicity, that e < L/2. Otherwise the proof can be adapted. 
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Now, it is not hard to see that [log2 1/e] — 1 is the smallest positive integer p 
such that 1/2'P < 2e. Therefore, we can bound K{e) as: 

1\ . /1\ , / 1 



K{e) <32L(l + 2 + 4... + 2ri°S2i/el-i^ (60) 



Set M = 64L and Q to the number of iterations required to obtain an iterate 
satisfying A'' < 1 (which is finite and independent of e) to obtain the result. D 

It is also possible to provide a logarithmic bound in 1/e. However, in this 
case both the multiplicative and additive constants depend on m. Thus, from 
such result alone we cannot infer the important property that the overall com- 
plexity of the algorithm can be bounded independently from the problem size. 
Furthermore, if m is comparable to or larger than 1/e (which is often the case 
in large-scale applications), there is no guarantee that the obtained bound is 



tighter than the one given by Eqn. (58 1. The proof of this result, which we 



state below for completeness, can be found in the Appendix. 

Proposition 5. Let K be the number of iterations performed by Algorithm^ 
until the stopping condition of Eqn. Q) is fulfilled. Then, there exists eo > 
such that, if e < Eq, 

K<Q + M\ogJ-) , (61) 



where Q and M are constants independent of e but dependent on m. In partic- 
ular, M (X m. 



5 Experiments 

In this section we present several experiments conducted on benchmark clas- 
sification datasets to evaluate the performance of the proposed methods and 
related approaches in practice. 

Datasets The datasets used in this section arc listed in Table [T] and can be 
found in several public repositories [5J [TB] . In order to provide the reader with 
an idea of the size of each problem, we specify the size m of the training set, the 
number of features n, and the number of classes K. We denote by t the number 
of test examples, set aside to evaluate the expected accuracy of the computed 
classifier. 

In the case of multi-category classification problems, we adopt a one-versus- 
one approach (OVO) [Ijlj Note that in these cases the number of examples 

^This was the method used in |41) to extend the CVM beyond binary classification, and 
according to I22| it usually outperforms other approaches both in terms of accuracy and 
training time. 
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m does not necessarily reflect the complexity of the training problems to be 
addressed. For example, according to m, the MNIST and Web w8a datasets 
have a similar size. However, the MNIST problem has 10 classes and the 
largest binary problem to solve in the OVO scheme is of around 13.000 training 
examples. The Web w8a problem is, in contrast, binary, and thus, the whole 
dataset need to be handled simultaneously. For this reason, we also report in 
Table [l] the size m-max of the largest binary subproblem and the size TTimin of the 
smallest binary subproblem in the OVO decomposition. 



Dataset 


m 


t 


K 


^^max 


TTT-jniu 


n 


USPS 


7291 


2007 


10 


2199 


1098 


256 


Pendigits 


7494 


3498 


10 


1560 


1438 


16 


Letter 


15000 


5000 


26 


1213 


1081 


16 


Protein 


17766 


6621 


3 


13701 


9568 


357 


Shuttle 


43500 


14500 


7 


40856 


17 


9 


IJCNN 


49990 


91701 


2 


49990 


49990 


22 


MNIST 


60000 


10000 


10 


13007 


11263 


780 


USPS-Ext 


266079 


75383 


2 


266079 


266079 


676 


KDD-lOpc 


395216 


98805 


5 


390901 


976 


127 


KDD-Full 


4898431 


311029 


2 


4898431 


4898431 


127 


Reuters 


7770 


3299 


2 


7770 


7770 


8315 


Adult ala 


1605 


30956 


2 


1605 


1605 


123 


Adult a2a 


2265 


30296 


2 


2265 


2265 


123 


Adult a3a 


3185 


29376 


2 


3185 


3185 


123 


Adult a4a 


4781 


27780 


2 


4781 


4781 


123 


Adult a5a 


6414 


26147 


2 


6414 


6414 


123 


Adult a6a 


11220 


21341 


2 


11220 


11220 


123 


Adult a7a 


16100 


16461 


2 


16100 


16100 


123 


Adult a8a 


22696 


9865 


2 


22696 


22696 


123 


Web wla 


2477 


47272 


2 


2477 


2477 


300 


Web w2sL 


3470 


46279 


2 


3470 


3470 


300 


Web w3a 


4912 


44837 


2 


4912 


4912 


300 


Web w4a 


7366 


42383 


2 


7366 


7366 


300 


Web w5a 


9888 


39861 


2 


9888 


9888 


300 


Web w6a 


17188 


32561 


2 


17188 


17188 


300 


Web w7a 


24692 


25057 


2 


24692 


24692 


300 


Web w8sL 


49749 


14951 


2 


49749 


49749 


300 



Table 1: Features of the selected datasets. 
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Initialization and Parameters For the initialization of the CVM, FW, 
MFW and SWAP methods, that is, the computation of a starting solution, 
we adopted the method proposed for the CVM in |4T]. In this approach, the 
starting solution is obtained by solving problem (l2]) on a random subset Xq of 
p training patterns. The indexes of ccq corresponding to other data points are 
set to zero. We used p = 20 points for initialization and e = 10~^ with all the 
algorithms. 

In all but the last experiment described in this section, SVMs were trained 
using a RBF (Gaussian) kernel 

fc(xi,X2) = exp (- "''' ~J'"' ) ^^^) 

with scale parameter a^. For the relatively small datasets Pendigits and 
USPS, parameter u^ was determined together with parameter C of SVMs us- 
ing 10-fold cross-validation on the logarithmic grid [2~^^, 2^] x [2~^, 2^^], where 
the first collection of values corresponds to parameter a^ and the second to 
parameter C. 

For the large-scale datasets, ct^ was determined using the default method 
employed for CVM in [41], i.e. it was set to the average squared distance among 
training patterns. Parameter C was determined on the logarithmic grid [2°, 2^^] 
using a validation set consisting in a randomly computed 30% of the training-set. 

We emphasize that the aim of this paper is not to determine optimal param- 
eter values by fine-tuning each algorithm to seek for the best possible accuracy. 
Our aim is to compare the performance of the presented methods and analyze 
their behavior in a manner consistent with our theoretical analysis. Therefore 
it is necessary to perform the experiments under the same conditions on a given 
dataset. That is to say, the optimization problem to be solved should be the 
same for each algorithm. For this reason, we deliberately avoided using different 
training parameters when comparing different methods. Specifically, parameters 
(T^ and C were tuned using the CVM method and the obtained values were used 
for all the algorithms discussed in this paper (CVM, FW, MFW and SWAP 
methods). 

Caching We also adopted the LRR caching strategy designed in [40] for the 
CVM to avoid the computation of recently used kernel values. 

Assessed Algorithms, Notation and Statistics In this paper we have 
introduced two variants of the FW method: the SWAP, and the second-order 
SWAP. The acronyms used to denote these algorithms in the figures will be 
SW and SW-2o, respectively. We will compare these methods against the CVM 
algorithm [JT], the FW method and the MFW method. 

In the next sections we report test accuracies, training times and model 
sizes obtained on the classification problems of Table [l] By test accuracy we 
intend the fraction of correctly classified test instances. Training time is the 
time in seconds required to obtain a model from the training set. When times 
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differ by more than one order of magnitude among the different methods we 
use a logarithmic scale to present figures. Model size is the number of training 
examples with non-zero weights at the end of the training process, that is, the 
number of support vectors in the model. 

To obtain a more detailed comparison, we compute the speed-ups obtained 
by the Frank- Wolfe based algorithms with respect to the CVM method. The 
speed-up of the FW method with respect to CVM will be measured as si = to/ti 
where to is the training time of the CVM algorithm and ii is the training time 
of the FW method, both measured in seconds. Similarly, the speed-up of the 
MFW, SWAP and SWAP-2o methods with respect to CVM is measured as 
52 = io/*2, S3 — to/ts, S4 = to/t4 respectively, where ^2 is the training time of 
the MFW method, is is that of SWAP, and t^ that of SWAP-2o. In addition, 
we quantify the difference in testing performance with respect to the CVM 
method. If we denote by uq the accuracy of CVM and by oi the accuracy of 
the FW method, the relative difference in accuracy incurred by FW will be 
quantified as di — (oq — ai)/aQ. Similarly, differences in testing performance 
corresponding to the methods MFW, SWAP and SWAP-2o will be measured 
as d2 = (ao — a2)/ao, d^ — (ap — a3)/ao and ^4 = (aq — a4)/ao, where 02, 03 
and a4 are the testing accuracies of the MFW, SWAP and SWAP-2o methods 
respectively. 

Computational Environment The experiments were conducted on a per- 
sonal computer with a 2.66GIIz Quad Core CPU and 4 GB of RAM, running 
64bit GNU/Linux. The algorithms were implemented based on the C-I--I- source 
code available at [ID]. 

5.1 Experiments on the Web Dataset Collection 

The Web Dataset Collection is a series of classification problems extracted from 
a webpage categorization dataset, first appeared in Piatt's paper on Sequential 
Minimal Optimization for training S VMs |29) . The number of training patterns 
in each instance of the collection grows approximately as rrii = 1.4*too, i = 
1, . . . , 8, where toq is the number of training patterns in the first dataset. This 
scheme makes the series amenable for studying performance and scalability of 
different training algorithms. 

Figures [TV a), [2la) and[3{a) report test accuracies, training times and model 
sizes (number of support vectors) obtained in this collection. Note that times 
are depicted in a logarithmic scale. From Figure[Ira) and Figurepfa) we confirm 
that all the Frank- Wolfe based methods are slightly less accurate than CVM but 
exhibit running times that scale considerably better as the number of training 
patterns increases. Each of them is faster than CVM in all the 8 datasets of the 
collection. 

Figure [2ja) illustrates one of the main points of this paper: the theoretical 
advantages of the MFW method over the basic FW routine often do not corre- 
spond to an improvement in practical performance. This collection of problems 
is actually an extreme case, in which MFW is always significantly slower than 
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FW. In contrast the proposed methods are faster than MFW and competitive 
with the FW method. 

From Figure l2Fa) we can observe that the speed-ups of the FW method seem 
to increase monotonically as the number of training patterns increases, ranging 
from 12. 6x faster up to ~ 106x faster than CVM. Speed-ups corresponding 
to the MFW method are in contrast significantly more hmited. The SWAP 
algorithm is clearly more competitive than MFW, with a speed-up of ^ 250 x 
in the largest dataset. 

Both MFW and SWAP endow the basic Frank- Wolfe procedure with away- 
steps, and both, in contrast to FW, offer a guarantee on the rate of conver- 
gence. However, the away steps implemented by SWAP and SWAP-2o work 
significantly better on this collection of datasets. SWAP-2o however does not 
perform better than SWAP in this series. We argue that standard away steps do 
not provide any significant advantage on this particular problem, as proved by 
MFW resulting to be the slowest algorithm. Since SWAP-2o invests more time 
in finding a good away direction, finding a solution takes more time in com- 
parison with the simpler SWAP, which seems to provide a better compromise 
between away and toward steps. 

As regards accuracy, MFW is slightly more accurate than SWAP, which in 
turn is slightly more accurate than FW most of the time. SWAP-2o very often 
outperforms the other three methods, approaching the accuracy of CVM. Note 
however that all the relative differences in testing accuracy are most of the time 
below 0.5%. Note finally that FW is the less accurate among the Frank- Wolfe 
based methods. 

As concerns model sizes, note that the additional computational time in- 
curred by the MFW and SWAP-2o methods is not compensated by an improved 
ability to find smaller models. Figure Isla) actually shows that the two faster 
methods, SWAP and FW, obtain most of the time smaller models. Finally, the 
size of the models found by CVM is significantly larger than that of the proposed 
methods. In addition, the percentage of training data used by this method to 
build the model does not seem to decrease significantly as the series progresses. 

5.2 Experiments on the Adult Dataset Collection 

The Adult Dataset Collection is a series of problems derived from the 1994 
US Census database. The goal is to predict whether an individual's income 
exceeded 50000US$/year, based on personal data. Like the Web datasets, this 
collection was designed with the purpose of analyzing the scalability of SVM 
methods. The number of training patterns grows approximately with the same 
rate, i.e. it increases by a factor of ~ 1.4 each time [53]. 

Figures [TJb),[2jb) andjSJJb) depict accuracies, running times and model sizes 
(number of support vectors) obtained on this collection. Times are depicted 
in a logarithmic scale. These results confirm that all the Frank- Wolfe based 
methods tend to be faster than the CVM algorithm as the number of examples 
becomes larger. Figure [2lb) shows that SWAP, MFW and SWAP-2o always run 
faster than CVM, reaching speed-ups of 27x, 20x and 15x respectively. Figure 
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Wla.) shows in addition that most of the times the Frank- Wolfe based methods 
achieve a testing performance greater or equal than CVM. 

Note that the speed-ups obtained by the FW method in this experiment are 
significantly smaller than those obtained in the Web collection. The largest 
speed-up achieved by the algorithm is 3.6 x on the sixth dataset of the collec- 
tion. In contrast, the methods investigated in this paper, SWAP and SWAP-2o, 
always show speed-ups larger than 10 x, running faster than FW in all cases. If 
we compute the median speed-up among all the datasets of this collection, the 
results for SWAP and SWAP-2o are 15. 5x and 20. 5x respectively. In contrast, 
the FW method achieves a median of just 1.45x. We conclude that the pro- 
posed methods are one order of magnitude faster than the basic FW method in 
this experiment. 

The previous remark suggests that away steps are very useful to speed up 
the algorithm towards an optimal face in this problem. We confirm this obser- 
vation by examining the performance of the MFW method in this experiment. 
Figure [2jb) shows that the MFW method is always faster than FW. This re- 
sult contrasts with our previous experiment in which MFW was always slower 
than FW. We conclude that in this experiment all the algorithms incorporating 
away steps are significantly faster the algorithms which do not. Note that the 
proposed methods SWAP and SWAP-2o always run faster than MFW. 

As regards testing accuracy, the CVM is most of the time slightly less accu- 
rate than Frank- Wolfe methods in this experiment. SWAP always obtains an 
accuracy greater or equal than FW and in all but one case an accuracy greater 
or equal than MFW. SWAP-2o is most of the time as accurate as MFW. We con- 
clude that the additional running time incurred by the CVM and FW methods 
is not compensated with a better accuracy in this series of datasets. 

Figurdsjb) shows that the model sizes obtained by the different methods are 
quite similar. 

5.3 Experiments on Other Medium-scale and Large-scale 
Datasets 

Results of Figures |4] to [8] show the accuracies, times, speed-ups and model sizes 
obtained in the other datasets of Tablejl] A detailed description of these datasets 
can be found in [14] or in the public repositories [8j and [16j . 

To simplify the presentation and further analysis, datasets were separated 
into two groups: medium-scale and large-scale datasets. A dataset was in- 
cluded in the first group if the largest binary subproblem (see column TTimax of 
Table [l]) to be addressed was lower than 15.000 training examples, and was in- 
cluded in the second group otherwise. According to this criterion, datasets Let- 
ter, Pendigits, USPS, Reuters and MNIST were put together in the first 
group and datasets Shuttle, IJCNN, USPS-Ext, KDD-lOpc and KDD- 
Fiill were included in the second group. Results for dataset Protein were 
presented/analyzed independently because accuracies and training times were 
significantly different from other results in the medium-scale group. Note again 
that most of the problems using in this experiment have been already used to 
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compare CVM against other algorithms to train SVMs [H]. Times and model 
sizes are depicted in a logarithmic scale. 

By examining Figure |4] we again observe a slight advantage of CVM in terms 
of testing accuracy. In addition, we confirm that the accuracy of the SWAP and 
SWAP-2o methods tends to be the closest to the best observed performance. 
The FW method is very often the less accurate among the Frank- Wolfe based 
algorithms. Note that if we compute the difference in accuracy with respect to 
CVM we always obtain results lower than 2%. 

Results in Figure|8]show that the FW, MFW, SWAP and SWAP-2o methods 
are most of the time faster than CVM. The speed-up achieved by these methods 
becomes more significant as the size of the training set grows, with peaks of 
around 100 x and 25 x on the largest datasets. Differences among the Frank- 
Wolfe methods depend on the size of the problem. Among the medium-scale 
datasets all the methods achieve running times of the same order of magnitude. 
Speed-ups in the large-scale group are clearly more significant with medians of 
27.3X, 15.0X, 30.7X, 29. 5x for FW, MFW, SWAP and SWAP-2o respectively. 

The advantage of the methods explored in this paper against standard FW 
routines can be summarized as follows. The FW and MFW methods can some- 
times be faster than SWAP and SWAP-2o but in that case the advantage is 
very tight. Often however, our methods can improve on FW and MFW with 
more significant speed-ups. MFW in particular tends to be significantly outper- 
formed in the cases where the FW works better. In those cases the performance 
of our methods tends to be competitive or better. On medium-scale problems 
all the methods are evenly matched in performance, with a slight advantage for 
SWAP-20 and MFW. In the large-scale group, SWAP and SWAP-2o tend to 
outperform FW and MFW more significantly. 

Results on the Protein dataset deserve a particular comment. This is a 
dataset of around 18.000 examples distributed into 3 classes, which leads to bi- 
nary subproblems of around 10000 examples. According to this size, the problem 
should be included in the group of medium-scale datasets on which we have seen 
that the Frank- Wolfe algorithms obtain fairly similar and small speed-ups. In 
the Protein problem however the methods obtain peculiar results. The FW 
method achieves here a speed-up of 20.8 x against CVM. However the standard 
MFW runs here 123.5x faster than the CVM and 5.95x faster than FW. This 
suggests that in this problem, away steps significantly help the algorithm to find 
more quickly the solution to the SVM problem. Since our methods tend to be 
better when aways steps work, we should observe important improvements on 
the CVM using the proposed methods. Indeed, the respective speed-ups for the 
SWAP and SWAP-2o methods on this datasets are 157.3x and 358. Ox. This 
means that SWAP runs 17.25x faster than FW and 1.27x faster than MFW. 
SWAP-20 runs 7.58x faster than FW and 2.90x faster than MFW. 

Note finally that Figure |6] suggests that there are no significant differences 
among the sizes of the models built by the different methods. 
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5.4 Statistical Tests 

In this section we perform some statistical tests to assess the significance of the 
experimental results reported in this paper. To this end we adopt the guidelines 
suggested in [TU]. We first conduct a multiple test to determine whether the 
hypothesis that all the algorithms perform equally can be rejected or not. Then, 
we conduct separate binary tests to compare the performances of each algorithm 
against each other. For the binary tests we adopt the Wilcoxon Signed-Ranks 
Test method. For the multiple test we use the non-parametric Friedman Test. 
In [10], Demsar recommends these tests as safe alternatives to the classical 
parametric t-tests to compare classifiers over multiple datasets. 

From the multiple test, we conclude that there is indeed a statistically sig- 
nificant difference among the running times and accuracies of all the algorithms 
(p- values were lower than 0.001 in both cases). 

We then conduct a binary test on each pair of algorithms. The main hy- 
pothesis of this paper is that the SWAP method outperforms the MFW and 
FW methods in terms of training time without significant differences in terms 
of predictive accuracy. In contrast, we claim that no significant differences be- 
tween the MFW and FW methods are observed in practice (although MFW 
seems to be slightly more accurate). We have also observed that the SWAP 
method significantly outperforms CVM, sometimes at the expense of a little 
test accuracy. Finally, we have observed that the SWAP-2o usually exhibits 
larger running times than the SWAP method but outperforms the other FW 
based methods in terms of predictive power. As regards the comparison of the 
proposed methods, there is no apparent advantage in terms of running time of 
one against the other. We thus conduct a two-tailed test for the running times 
but adopt a one-tailed test for testing accuracy. Considering all the observations 
above, our design for the binary tests is that of Table [2J 

In Table [2J we also report the p- values corresponding to each test. |j For 
reproducibility concerns, p- values were computed using the statistical software R 
[5U| . For the Wilcoxon Signed-Ranks Test, the exact p- values were preferred to 
the asymptotic ones. The Pratt method to handle ties is employed by default. 
In the case of the Friedman test, the Iman and Davenport's correction was 
adopted, as suggested in [TU] . 

We now point out some of the conclusions which can be obtained from Table 
[2J At commonly used significance levels (10%, 5%, 1% or lower), the hypothesis 
that FW and MFW are equally fast cannot be rejected. Adopting a significance 
level of 5%, the running times of SWAP method are found to be significantly 
different from those of all the baseline methods (FW, MFW and CVM), so 
the null hypothesis is rejected in favor of the alternative hypothesis than the 
SWAP method is faster. At the same significance level, or better, the hypotheses 
than the SWAP-2o method is as fast as MFW or CVM are rejected in favor of 



^In some cases we implement one-sided alternative hypotheses and in other two-sided tests. 
If a two-sided test is preferred to a one-sided alternative, it's enough to double the p-value 
reported here. Vice versa, if a one-sided test is preferred to a two-sided test, it's enough to 
halve the p-value reported here. 
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Table 2: Null hypotheses, alternative hypotheses and p-values for the binary 
statistical tests. The conclusion of the test adopting a significance level of 5% 
is highlighted in blue. 



Time SWAP vs. FW T 


p-value 


Accuracy SWAP vs. FW 


p-valuo 


Ho : Both equally fast 


0.0376 


Ho : Botli equally accm-ate _| 
Hi : Different accuracies 


0.2389 


Hi : SWAP faster 


Time SWAP vs. MFW 


p- value 


Accuracy SWAP vs. MFW ' 


p- value 


Ho ■ Both equally fast 
Hi : SWAP faster 


f.53e-05 


gHo : Both equally accm^ate j 
Hi : Different accuracies 


0.10f9 


Time SWAP vs. CVM 


p-value 


Accuracy SWAP vs. CVM 


p-value 


Ho : Both equally fast 
Hi : SWAP faster 


5.53e-06 


Ho : Both equally accurate 
Hi : CVM more accurate 1 


1.87e - 04 


Time FW vs. MFW p-value 


Accuracy FW vs. MFW 


p- value 


■fiin : Both equally ^^^^^ 


0.6893 


^^^Both equally accura^^^ 
Hi : Different accuracies 


0.ff2 


Hi : Different times 


Time SWAP-2o vs. FW ^ p-value 


Accuracy SWAP-2o vs. FW^ p-valuo 


-iia. : Both equally ^^^^^ 


0.3403 


Ho : Both equally accurate 
Hi : SWAP-20 more accurate 


0.0f07 


Hi : Different times 


Time SWAP-2o vs. T^FW^ p-value 


Accuracy SWAP-2o vs. MFW 


p- value 


Ho ■■ Both equally fast 
Hi : SWAP-20 faster 


f.09e-04 


Ho : Both equally accurate 
Hi : SWAP-20 more accurate 1 


0.0f63 


Time SWAP-2o vs. CVM ] p-value 


Accuracy SWAP-2o vs. CVM 


p-value 


Ho : Both equally fast 
Hi : SWAP-20 faster 


4.47e - 08 


Ho : Both equally accurate 
Hi : CVM more accurate ] 


2.25e - 04 


Time SWAP-20 vs. SWAP 


p-value 


Accuracy SWAP-2o v^ p-value 


Ho : Both equally fast 
Hi : SWAP faster 


0.f901 


Ho : Both equally accurate 


1.42e - 03 


Hi : SWAP-2o more accurate" 





the conclusion that the SWAP-2o method is faster. Empirical data is however 
insufficient to reject the hypothesis that the SWAP-2o method is as fast as the 
FW or the SWAP methods. As regards the testing accuracy, FW, MFW and 
SWAP are found to be equally as accurate at reasonable significance levels (10%, 
5%, 1% or lower). In contrast, the hypothesis that the SWAP-2o method has 
similar accuracies to FW, MFW and SWAP is rejected in favor of the conclusion 
that SWAP-2o is more accurateH] 

5.5 Experiments with Non-Normalized Kernels 

Solving a classification problem using SVMs requires to select a kernel function. 
Since the optimal kernel for a given application cannot be specified a priori, the 
capability of a training method to work with any (or the widest possible) family 
of kernels is an important feature. 



^Note: we decided to exclude the results on the a8a dataset from this statistical analysis, 
because one of the baseline methods (CVM) did not terminate in a reasonable time 
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(a) Web collection 

CVM^ FW^ MFW^ SWI 



(b) Adult collection 

CVM^ FW^ MFW^ SW^ SW-2o 




84.0 84.5 

accuracy 



accuracy 
Figure 1: Testing accuracies in tiie Web and Adult collections. 
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(a) Web collection 



(b) Adult collection 
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Figure 2: Running times in the Web and Adult collections. The column on the left 
shows speed-ups with respect to CVM. 
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(b) Adult collection 
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Figure 3: Model sizes in the Web and Adult collections, 
shows the percentage of the total number of examples. 
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(a) Medium-scale datascts 



(b)Large-scale datasets 
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Figure 4: On the left, testing accuracies in the medium-scale datasets: Letter, 
Pendigits, USPS, Reuters, MNIST. On the right, testing accuracies in the large 
dataset collection: Shuttle, IJCNN, USPS-Ext, KDD-lOpc, KDD-Full. 
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Figure 5: On the left, testing accuracies in the Protein dataset. On the right, running 
times in the Protein dataset. 
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(a) Medium-scale datascts 



(b)Large-scale datascts 
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Figure 6: On the left, model sizes in the medium-scale datasets Letter, Pendigits, 
USPS, Reuters, MNIST. On the right, model sizes in the large dataset collection: 
Shuttle, IJCNN, USPS-Ext, KDD-lOpc, KDD-Full. 
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(a) Medium-scale datasets 
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Figure 8: On the left, running times in the medium-scale datasets: Letter, Pendig- 
its, USPS, Reuters, MNIST. On the right, running times in the large dataset 
collection: Shuttle, IJCNN, USPS-Ext, KDD-lOpc, KDD-FulI. 
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In order to illustrate that the proposed methods can obtain effective models 
even if the kernel does not satisfy the conditions required by CVM, we conduct 
experiments using the homogenous second order polynomial kernel A:(xi,Xj) = 
(7xf Xj)^. Here, parameter 7 is estimated as the inverse of the average squared 
distance among training patterns [H]. 

Figures [9] and [TO] summarize the results obtained in some of the datasets 
used in this section. We can see that both test accuracies and training times are 
comparable to those obtained using the Gaussian kernel. It should be noted that 
the CVM algorithm cannot be used to train a SVM using the kernel selected for 
this experiment and thus we only incorporate the Frank- Wolfe based methods in 
the figures. These results demonstrate the capability of our methods to be used 
with kernels other than those satisfying the normalization condition imposed by 
CVM. 
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Figure 9: On the left, testing accuracies obtained with the polynomial kernel in the 
datasets of the Web collection, wla, wSa, w3a, w^a, w5a, w6a, w7a and w8a. On the 
right, the corresponding running times. 
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Figure 10: On the left, model sizes obtained with the polynomial kernel in the datasets 
of the Web collection. On the right, testing accuracies, running times and coreset sizes 
obtained in the Shuttle and Reuters datasets. 
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6 Conclusions 

The main contribution of this paper is twofold. On the theoretical side, we pro- 
posed a new variant of the FW method for the general problem of maximizing a 
concave function on the unit simplex, introducing a novel way to perform away 
steps in the FW method devised to boost its convergence. On the practical side, 
we demonstrated that our approach is very effective in improving the perfor- 
mance of state-of-the-art SVM learners for large datasets, further expanding on 
the research about FW methods for Machine Learning problems. 

We presented two variants of the procedure, SWAP and SWAP-2o, for which 
we provided a thorough theoretical analysis. First, we demonstrated that they 
converge globally. Second, we showed that SWAP and SWAP-2o asymptotically 
exhibit a linear rate of convergence, which is, as in the case of the MFW method, 
the main additional property with respect to the standard FW method. Finally, 
we proved that they achieve a primal-dual gap lower than a given tolerance e in 
0(l/e) iterations, independently of m, the dimensionality of the feasible space 
and the number of examples in SVM problems. 

We then carried out an extensive set of performance evaluation experiments 
for both variants of the algorithm. The obtained results demonstrated that, 
in contrast to the MFW method, our approach provides a useful and robust 
alternative to the FW method for training SVMs. 

Most often, the proposed methods SWAP and SWAP-2o improved on the 
performance of MFW. The SWAP method was faster than MFW on all the 
datasets of the Adult collection, the Web collection and the Protein problem. 
In the large-scale group of Figure [8[b) SWAP outperformed MFW on 4 (out of 
5) datasets. In the medium-scale problems of Figure IsFa) SWAP was slightly 
slower. 

The SWAP-2o method was faster than MFW on all the datasets of the Adult 
collection, 6 (out of 8) datasets in the Web collection and 4 (out of 5) datasets 
in the large-scale group of FigurelsFb). SWAP-2o was also faster in the Protein 
problem and slightly faster on the medium-scale problems of Figure IsF a). 

The conclusion that SWAP and SWAP-2o are faster than MFW was found 
statistically significant at significance levels of 1% or better. Often, the SWAP 
method improved on MFW by one order of magnitude and sometimes by two 
orders of magnitude. In addition, in the cases in which MFW was faster, the ad- 
vantage was less significant than the improvements of our techniques on MFW. 

The proposed methods were also faster than the basic FW method several 
times. For example, SWAP ran in median 15 times faster than FW in the Adult 
collection and SWAP-2o ran 20 times faster. Similar results were observed in 
the Shuttle and Protein datasets. We found that the conclusion that SWAP 
is faster than FW is statistically significant at a critical value of around 4%. In 
contrast, we were not able to reject the hypothesis that MFW and FW lead to 
similar training times. Similarly, we cannot conclude that FW and SWAP-2o 
have different running times. 

Another important conclusion of our experimental results arises after an 
analysis of the cases in which either FW or MFW fail in improving running 
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times of CVM by a significant amount. 

• fn some cases, away steps of MFW significantly speed-up the FW method. 
Some examples were the Adult collection, the Shuttle and Protein 
datasets. In those cases, the SWAP method is competitive with or faster 
than MFW and significantly faster than FW. 

• In some other cases, classic away steps fail. MFW achieves in those cases 
notably worse running times. For instance, we observed this behavior in 
the Web collection, the USPS-Ext and KDD-lOpc datasets. In those 
cases, the SWAP method is clearly faster than MFW. In addition, it is 
competitive with the fastest algorithm (FW). 

We conclude that the SWAP method can be expected to be faster than MFW 
in those cases in which classic away steps effectively boost the convergence of 
the FW method but also very competitive against FW when away steps fail. 
Thus, SWAP is a robust alternative to FW, MFW or CVM. From this point 
of view, the SWAP-2o method is less appealing. Even if SWAP-2o outperforms 
more significantly the standard FW when away steps are useful, this technique 
seems to fail very often in the same cases in which MFW fails. If we knew that 
away steps were going to be useful for a given problem, SWAP-2o would be the 
algorithm of choice. However, since we cannot predict that in advance, MFW 
and SWAP-2o are less reliable in practice. 

Finally, our experiments have demonstrated that the improvements in run- 
ning time that we obtain on FW or MFW do not come at the expense at the 
expense of testing accuracy. Most of the time SWAP is slightly more accurate 
than FW and as accurate as MFW. 



A Technical Results 

Here we report statements and proofs of a number of technical results, some of 
which are used in the theoretical analysis of Section 4. 

A.l Perturbation Analysis 

We follow the analysis presented in [1], which is in turn based on the pertur- 
bation method of Robinson [32] . Consider the following perturbed variant of 

0, 

maximize w{a) — g{a) — z a 

(63) 
subject to 1 a = I, a > , 

where z e M'" is perturbation vector. 

Now, suppose we have a A^-approximate solution a* € M™. We are aimed 
to show that a^, is the solution of a perturbed problem with a certain z. We 
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define z — z(a*, A^,) by 

f A, if a,, = 

V.g(a*)i - Q;^V.g(a*) if a^i > 

Note first tliat if a^ ^ 

Zi = Vg(a*)i - q;^V5(q:*) > -A^ 
Zi = Vg{a^,)i - a'^Wgia^,) < +A* , 



Tv. . ^ - - . ' (64) 



Tv^ . ^ . . . (65) 



because a* is a A^^-approxiniate solution. If a^,i = 0, z^ = A* by construction. 
Then, 

||z|p = ^|z.P<mAS. (66) 

i 

Note in addition tliat 

af z = ^ a^iZi = ^ a^iZi = af V.g(Q:*) - (oi^Wgia^,)) a^l = , ^QJ^ 

because a^, is feasible for (fll). Note finally that 

Vw(a) = V5(a) - z . (68) 



Thus, from Eqn. (67) we obtain that the following stationarity condition is 
fulfilled 

a'^Vw{ai,) ~ or[Vg{a.i,) — af z = af V.g(Q:*) . (69) 

The following lemma follows easily from the previous remarks. 
Lemma 3. //a* is a A^, -approximate solution, then cx.^ is optimal for problem 



(63) with z ~ z(a^, A^) as defined in Eqn. {64)- 



Proof. It follows from the concavity of problem ( 63 1 and the remarks above 



See [T], Lemma 3.3 for details. D 

The next lemma is the basis of the analysis of the rate of convergence for 
the modified Frank- Wolfe methods. 

Lemma 4. Let a.* be the solution of problem pp and cc^, a Ai,-approximate 
solution. Then, 

g{a*)-g{a^) < \\z\\\\a* - a^\\ < ^A^a* - a^ . (70) 



Proof. The vector a* is feasible for the perturbed problem (63 1 with z = 
z{ai,, A^,). Since q;^, is optimal for this problem, we have g{(y.*) — z^a* < 
g{ct^,) — z'^a.i,. This demonstrates the first inequality. The other follows from 



Eqn. (66). n 



From here, the following lemma follows easily. 
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Lemma 5. Let a.* be the solution of problem Hv and q;^, a Ai,-approximate 
solution, where A^, is sufficiently small. Then 

g{a*)~gia,)<NmAl, (71) 

for some Lipschitz constant N . 

Proof. See IJ to see how from the Robinson condition it follows that there exists 
a Lipschitz constant N such that, for sufficiently small A*, ||q;* — a^^H < A^||z|| < 
Ny/mA^. Combining this result with the previous lemma yields the result. D 



Now, since g is twice differentiable, the Taylor expansion for g{ak + Ad) as 
a function of A is 

g [at + Ad) = g{ak) + Wgiakfd + ^A"d^V'5(a)d , (72) 

where a is some point on the line between cxk + Ad and a^. Since g is concave, 
the Hessian matrix g{a) is negative semi-definite, so the last term is always 
non-positive. To obtain a bound for g (a^. + Ad) — g{oi.k) we need a bound L 
on the norm of V^g{(y.) over the simplex. We can set L to the largest absolute 
value of an eigenvalue of this matrix. We therefore obtain the following bound: 

g {oLk + Ad) - g{o.k) > XVg{akfd - ^\^L\\df . (73) 

We now exploit the previous bound to analyze the improvement in the objec- 
tive function Si„ after a standard Frank- Wolfe (FW) step, and the improvement 
Inswap after a SWAP step in Algorithm [4J 

A. 2 Objective Function Improvement after Prank- Wolfe 
Steps 

For a FW step we have d = e^* — cy.^. Thus, 

(5fw = g (ctk + X{eu - OLk)) - g{oLk) 

> A ('Vg{ak)i* - Vg{ak)'^ak] - -^>? \\ei* - tXkW'^ ■ 

But both e,;* and a.k lie in the simplex. Hence ||ej* — ak\\ < 2. This leads to 



^tw > 9 {o-k + K^i* - OLk)) - g{OLk) , , 

(75) 
> A (V5(afc)„ - V5(afc)^afe) - LA^ . 

The maximum of the right-hand side is obtained for 

^* ^ Vg(Qfc)» - Vg{akVoik ^ygs 

If A*„ < 1, the improvement in the objective function after an iteration marked 
as a FW step in Algorithm |4] is bounded by 

. {^g{Q.k)i* - Vg{oLk)'^ak) i'j'j\ 

'^- - iZ ' 
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and by reordering we obtain 

Vg(afe)»* - V.g(afe)^afe < 2/lC < 2/L4 , (78) 

where the latter inequahty follows from the definition of 6k — max(6f„,6^„^^). 
Now, if A*^ > 1, we cannot use this step-size. In that case we use the step-size 
A = 1. But A*^ > 1 implies 

VgiotkU - \/g{akfak > 2L . (79) 

Thus, the improvement in the objective function for an FW step can be bounded 
in this case as (use Eqn. (JTSJ) with A = 1 and exploit the inequality above) 



^ ^ (Vg(Qfc)»* - ygjakVak) ^g^^ 

which leads to 

V(/(afc)„ - Vgiakfctk < 26,^ < 24 , (81) 

In any case, we have the following bound for the improvement of the objective 
function 

5.W > mm (^ , j , (82) 

which guarantees that for any k 

Vg{a^:)^, - Vg{cxkfoLk < max (2^/LSZ , 2(5f„) < max (ly^LS^ , 25fc) . (83) 

Now, since \7g{ak)i < Vg{cy.k)i* Vi, the following inequality is guaranteed at 
each iteration of Algorithm l4] for any i: 

Vg{ak)^ - alVgioik) < max (2/L4 , 2^) • (84) 

A. 3 Objective Function Improvement after SWAP Steps 

We now bound the improvement obtained by SWAP steps. In this case, d = 



4wap = g (ttfc + A(ei, - Gj,)) - g{ak) 

> X {Vg{ak)i* - Vg[a.k)j*) - ^A ||ei, - e^. 



(85) 



But ||e,* -ej*||2 = 2. Thus, 

<5swap > A (V5(afc)„ - Vg{a.k)j,) - A^L . (86) 

The maximum of the right-hand side is obtained for 

^* ^ Vg(Qfc)» - Vg{Q.k)j* .gyx 

'''swap or ' 
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If A*^^ < 1, the improvement in the objective function for an unconstrained 
SWAP step, that is, an iteration marked as SWAP-add in Algorithm |4] is 
bounded as 

c > (Vff(Qfc)»* - Vg(Qfc)3*)^ (gg) 

Note now that ol^V g{a.k) < yg{oi-k)i* because Vg{a.k)i* — max^ Vg(afc)i and 
a^l = 1. This observation leads to 

. (Qfc Vff(Qfc) - Vff(Qfc)j.) (gg) 

0™ap _ ^^ 

By reordering, we obtain the following inequality, 



oclVg{cXk) - yg{ak)J. < 2^/LSZZ < ^VLh , (90) 

where we have used the definition of Sh. Now, if A* > 1, we cannot use this 

"- ■ s"wap ' 

step-size. In that case we use the step-size A = 1. Recall that we are supposing 
to be performing a SWAP-add step. In a way analogous to the Frank- Wolfe 
case, the case A*„^p > 1 implies 

Vg{ak)„ - Vp(afc)j. > 2L . (91) 

Thus, the improvement in the objective function is bounded in this case by 

> Qfc V.g(afc) - Vgjcxk)]* ^92) 






which leads to 

alVgiak) - V.g(afe)j, < 2J_p < 2^ . (93) 

In any case, we have the following bounds for the SWAP case 

. . . f ialVg{ak)-Vg{ak)j,Y («fc Vg(afc) - Vff(afc),.) ^ ,„.. 

"swap ii mm I , I , (.y^j 

which leads to 

alVgictk) - Vg{ak)j* < max {2y/LSZZ , 25™ap) < max (2/L4 , 2^) , (95) 
Note now that the definition of j* can be rearranged as 



j* = arg niin Vg{a.k)j - alVg(a.k) 
j* = argmaxQ;^V5(afc) - Vg{a.k)j ■ 



(96) 



Thus, we obtain that the following inequality is guaranteed at each iteration of 
Algorithm [4] Vi such that ak,i > 0: 

alVg{ak) - Vg{ak)^ < max (2/L^ , 24) ■ (97) 

Remark 1. After a swap-drop step in Algorithm |4] we cannot bound the im- 
provement in the objective function, because the clipped value of the step-size 
Aswap* may be arbitrarily small. However, it is not hard to show that the objec- 
tive function value does not decrease. 



44 



A. 4 Iteration Complexity Bounds 

Here we demonstrate Proposition [5] 

Proof. The proof is very similar to that of Proposition |4J but here we employ 
Lemmap^to obtain a different bound for g{a*)—g{ak). LenimaRlshows that for 
sufficiently large k the iterate ctk produced by the SWAP method after a SWAP- 
add or FW step is a A-approximate solution, with A = 2^/LSk- In addition, 
since SWAP is globally convergent, 6k can be chosen to be arbitrarily small. 
Thus, there exists Kq large enough that Vfc > Kq, if cy.^ is a (S-approximate 
solution, g{a*) — g{oLk) < NmS^. Define k{S) and K{£) as in the proof of 
Proposition [sj Let Eq be the primal-dual gap A'' at iteration Kq. For 6 < Eq ^e 
have 

HS) < 2 = 2 16NmL = 32NmL . (98) 



4L 

The factor 2 again comes from the fact that the total number of iterations is 
at most two times the number of SWAP-add and FW iterations plus a finite 
constant (which has no bearing on the result). Now, 

Kie) < Ko + kieo) + fc (f ) + fc (f ) . . . + fc ( ^^,^^ ^^J^^ ,,^^^^_ J 

<Ko + 32NmL{\log2 l/e - log2 1/eol ) '^^^^ 

<Ko + 32NmL (1 - log2 1/eo) + 52NmL (loga l/e) . 

Set M = 32NmL and Q = Ko + 32NmL (1 - logj 1/eo) to obtain the resuh. D 
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